exclude fsdp from delay_optimizer_creation #34140

eljandoubi · 2024-10-13T21:22:50Z

What does this PR do?

It passes the model and the optimizer to accelerate.prepare in order to enable fp8 mixed precision, if any.

Fixes #34024

Who can review?

Library:

trainer: @muellerzr and @SunMarc

-->

muellerzr

Nice :) Can we add a test in tests/test_trainer.py? We can set env variables to configure Accelerate properly (ACCELERATE_MIXED_PRECISION="fp8" will auto-use TE)

HuggingFaceDocBuilderDev · 2024-10-14T14:44:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

eljandoubi · 2024-10-14T15:29:47Z

The required tests are distributed tests. We need to verify FSDP functionality with and without FP8 mixed precision. The appropriate test file might be tests/trainer/test_trainer_fsdp.py.

eljandoubi · 2024-10-14T19:53:51Z

Is the test/trainer folder included in the CI tests? Where can I check the results for test_trainer_fsdp.py? @muellerzr @SunMarc

muellerzr · 2024-10-15T15:56:40Z

@eljandoubi we can't run them on the normal CI since GPU runners are not part of PR's.

Instead when ready I'll pull the PR down and run it myself

eljandoubi · 2024-10-15T17:13:56Z

@muellerzr Thank you for the information. I have tested the branch in my code on a multi-node, multi-GPU setup using FSDP mode, both with and without FP8 mixed precision, and it worked as expected. Please let me know if you encounter any issues on your end.

* Fix FSDP Initialization for resume training * Added init_fsdp function to work with dummy values * Fix FSDP initialization for resuming training * Added CUDA decorator for tests * Added torch_gpu decorator to FSDP tests * Fixup for failing code quality tests

…() method

eljandoubi · 2024-10-22T12:03:57Z

@muellerzr Any updates regarding this PR?

muellerzr

Thanks! Can you do pip install -e .[quality] followed by make fixup? I'll then pull it locally to test on my 4090 system and we should be set!

eljandoubi · 2024-10-25T13:22:00Z

@muellerzr I have done make fixup.

ArthurZucker

Don't worry we'll merge as is, failing tests are unrelated!

* exclude fsdp from delay_optimizer_creation * add test case for trainer: FSDP mode and fp8 as mixed precision * rearrange imports * ruff formatted * adapt _init_fsdp to fp8 * use _init_fsdp only when resume_from_checkpoint * In case of FDP, self.layer will be CheckpointWrapper which has no len() method * delete _init_fsdp * solve conflict * fix conflict * make fixup

exclude fsdp from delay_optimizer_creation

cd0e8bb

muellerzr reviewed Oct 14, 2024

View reviewed changes

eljandoubi added 3 commits October 14, 2024 19:49

add test case for trainer: FSDP mode and fp8 as mixed precision

d18e642

rearrange imports

3344110

ruff formatted

5055e2a

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

656d7cc

eljandoubi added 2 commits October 16, 2024 10:22

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

22cc58d

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

4827a39

eljandoubi and others added 16 commits October 16, 2024 15:35

adapt _init_fsdp to fp8

4a84f0f

use _init_fsdp only when resume_from_checkpoint

2e91c5f

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

f5a3796

In case of FDP, self.layer will be CheckpointWrapper which has no len…

af73835

…() method

delete _init_fsdp

a2f30b0

solve conflict

a838ba5

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

cc5b4c3

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

d84336f

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

acffb63

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

78eed70

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

49882f8

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

9ac4664

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

5acf8e0

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

d7a0194

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

58d18f6

fix conflict

b94376d

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

b9b9eb4

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

f466513

muellerzr approved these changes Oct 23, 2024

View reviewed changes

muellerzr requested review from ArthurZucker and SunMarc October 23, 2024 16:22

eljandoubi and others added 6 commits October 24, 2024 14:30

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

2948b29

Merge branch 'main' into fix_fsdp_with_fp8_in_trainer

748270d

make fixup

0ec8e58

Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer

09df2ed

Merge branch 'main' into fix_fsdp_with_fp8_in_trainer

a3265d9

Merge branch 'main' into fix_fsdp_with_fp8_in_trainer

33902fd

eljandoubi added 3 commits October 25, 2024 16:53

Merge branch 'main' into fix_fsdp_with_fp8_in_trainer

cfd8152

Merge branch 'main' into fix_fsdp_with_fp8_in_trainer

571e58f

Merge branch 'main' into fix_fsdp_with_fp8_in_trainer

02a63c7

ArthurZucker approved these changes Oct 28, 2024

View reviewed changes

ArthurZucker merged commit 8b3b9b4 into huggingface:main Oct 28, 2024
18 of 22 checks passed

muellerzr mentioned this pull request Dec 11, 2024

Fix FSDP no longer working #35212

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exclude fsdp from delay_optimizer_creation #34140

exclude fsdp from delay_optimizer_creation #34140

eljandoubi commented Oct 13, 2024

muellerzr left a comment

HuggingFaceDocBuilderDev commented Oct 14, 2024

eljandoubi commented Oct 14, 2024 •

edited

Loading

eljandoubi commented Oct 14, 2024

muellerzr commented Oct 15, 2024 •

edited

Loading

eljandoubi commented Oct 15, 2024

eljandoubi commented Oct 22, 2024

muellerzr left a comment

eljandoubi commented Oct 25, 2024 •

edited

Loading

ArthurZucker left a comment

exclude fsdp from delay_optimizer_creation #34140

exclude fsdp from delay_optimizer_creation #34140

Conversation

eljandoubi commented Oct 13, 2024

What does this PR do?

Who can review?

muellerzr left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Oct 14, 2024

eljandoubi commented Oct 14, 2024 • edited Loading

eljandoubi commented Oct 14, 2024

muellerzr commented Oct 15, 2024 • edited Loading

eljandoubi commented Oct 15, 2024

eljandoubi commented Oct 22, 2024

muellerzr left a comment

Choose a reason for hiding this comment

eljandoubi commented Oct 25, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

eljandoubi commented Oct 14, 2024 •

edited

Loading

muellerzr commented Oct 15, 2024 •

edited

Loading

eljandoubi commented Oct 25, 2024 •

edited

Loading